In this section I import the data and prepare it for analysis.
Code
```{r setup}#| warning: false#| message: falselibrary(tidyverse)library(plotly)library(dplyr)library(tidyr)library(knitr)library(table1) #Create HTML Tables of Descriptive Statistics https://cran.r-project.org/web/packages/table1/vignettes/table1-examples.html#library(OMTM1) #https://github.com/schildjs/OMTM1/library(Hmisc)library(rms) # Regression Modeling Strategies by Frank https://cran.r-project.org/web/packages/rms/index.htmllibrary(modelsummary) #Summary Tables and Plots for Statistical Models and Data: Beautiful, Customizable, and Publication-Ready https://cran.r-project.org/web/packages/modelsummary/index.htmllibrary(scales) # The scales packages provides the internal scaling infrastructure used by ggplot2, and gives you tools to override the default breaks, labels, transformations and palettes. https://scales.r-lib.orglibrary(viridis) #colorslibrary(cowplot) #allows me to use plotgridlibrary(gridExtra) #adding tables to plotslibrary(visdat) #shows missing datalibrary(GGally) #makes pairs plotslibrary(sandwich) #for robust standard errorslibrary(gtsummary) #makes the tables look nicerlibrary(glmnet) #for running LASSOlibrary(mice)#if you want to rerun this in your computer, change the directory to where you want to work on your machine locally (do not upload data to github!)setwd("/Users/lisalevoir/BIOS7351_Collab/data_project2") #this line I would need to run in the consoleknitr::opts_knit$set(root.dir ="/Users/lisalevoir/BIOS7351_Collab/github/BIOS_Collaboration/project_2_analysis") #now I set global options for knitting, I also had to toggle global options > R Markdown > evaluate chunks in current directory#THIS IS WHERE YOU'D CHANGE THE FILE PATHS TO WHERE YOU STORED THE DATA LOCALLY#import the datadat <-read.csv("/Users/lisalevoir/BIOS7351_Collab/data_project2/combined_data_203.csv")#to compare that the merging went as expectedVUMSdat <-read.csv("/Users/lisalevoir/BIOS7351_Collab/data_project2/DATA_VUMS.csv")HMSdat <-read.csv("/Users/lisalevoir/BIOS7351_Collab/data_project2/DATA_HMS.csv")UVAdat <-read.csv("/Users/lisalevoir/BIOS7351_Collab/data_project2/DATA_UVA.csv")```
1.0.1 Inclusion/Exclusion
Criteria to exclude students who most likely took a scored exam:
Any PhD students (n = 2)
Any 5th year program students (n = 19)
M4 students at Vanderbilt (n = 5)
Students who did not complete either Step survey (n = 2)
Students who specifically stated they took a scored Step 1 (n=1)
[1] "uworld_percent_step1_1" "uworld_percent_step1_2"
[1] "amboss_percent_step1_1" "amboss_percent_step1_2"
[1] "length_step1_1" "length_step1_2"
[1] "practicetest_step1_1" "practicetest_step1_2"
[1] "full_test_practice_step1_1" "full_test_practice_step1_2"
[1] "push_step1_1" "push_step1_2"
[1] "push_practice_test_step1_1" "push_practice_test_step1_2"
[1] "push_nbme_practice_score_step1_1" "push_nbme_practice_score_step1_2"
[1] "push_uw_practice_score_step1_1" "push_uw_practice_score_step1_2"
[1] "final_nbme_practice_score_step1_1" "final_nbme_practice_score_step1_2"
[1] "final_uw_practice_score_step1_1" "final_uw_practice_score_step1_2"
[1] "score_step1_1" "score_step1_2"
[1] "other_resources_step1_1" "other_resources_step1_2"
[1] "other_step1_1" "other_step1_2"
[1] "changes_step1_1" "changes_step1_2"
[1] "Above is a list of columns I have combined for you. Hope it looks right!"
[1] "uworld_percent_step2_1" "uworld_percent_step2_2"
[1] "amboss_percent_step2_1" "amboss_percent_step2_2"
[1] "length_step2_1" "length_step2_2"
[1] "practicetest_step2_1" "practicetest_step2_2"
[1] "full_test_practice_step2_1" "full_test_practice_step2_2"
[1] "practice_score_step2_1" "practice_score_step2_2"
[1] "practice_test_step2_1" "practice_test_step2_2"
[1] "score_step2_1" "score_step2_2"
[1] "target_score_step2_1" "target_score_step2_2"
[1] "Above is a list of columns I have combined for you. Hope it looks right!"
There were 203 participants after the the clients did data exclusion by hand, as compared to the total survey size which was 182 unique individuals included in the step 1 data.
These are the record IDs that were excluded by the collaborator: VUSM 23, VUSM 60, HMS 37, HMS 41, HMS 49, UVASOM 33, UVASOM 80, UVASOM 84.
For our analysis, I originally included everyone from the 3 school specific data sets. Then, we flagged individuals to remove and I remade the source data set. These are the records IDs that we decided to exclude:
VU record ids: 23, 26, 39, 40, 54, 3, 8, 12, 49, 60, 62, 64
HMS record ids: 1, 21, 28, 30, 34, 37, 39, 41, 44, 47, 49, 61
UVA record ids: 33, 47, 80, 81, 83
Below are tables for Step 1 and Step 2 response inclusion by school:
Code
kable(table(step1_complete$schoolid), caption ="Number of Step 1 responses included by school")
Number of Step 1 responses included by school
Var1
Freq
HMS
50
UVa
79
VU
53
Code
kable(table(step2_complete$schoolid), caption ="Number of Step 2 responses included by school")
Number of Step 2 responses included by school
Var1
Freq
HMS
39
UVa
65
VU
34
Notice that unfortunately for Step 2 analysis, we had to drop 44 individuals who did not report a step 2 score. These raw counts and frequencies of people who did not give a step 2 score (and are therefor not eligible for analysis) are listed by institution below.
Code
# describing the missingnesskable(table(drop_these_with_no_outcome$schoolid), caption ="Number of missing Step 2 scores by institution")
Number of missing Step 2 scores by institution
Var1
Freq
HMS
11
UVa
14
VU
19
Code
num <-as.vector(table(drop_these_with_no_outcome$schoolid))denom <-as.vector(table(step2_complete$schoolid))nonresponse_freq <-setNames(c(round(num/denom, 3)), c(names(table(step2_complete$schoolid))))kable(nonresponse_freq, caption ="frequency of missing step 2 scores by instition")
frequency of missing step 2 scores by instition
x
HMS
0.282
UVa
0.215
VU
0.559
We also find it helpful to visually profile the missingness in this graphic below. Some of this missingness is due to questions being hierarchical.
Note, we plan to use the robust/sandwich variance estimator for regression models. One inclusion criteria is that the outcome variable (“Y”) must be available for a subject to be included in the analysis question (ie. if they did not report a step 2 score, we won’t perform relevant step 2 analysis on them).
step 1 first step 2 first only step 1 only step 2
71 67 0 0
Code
table(step2_complete$which_exam_first)
1 2
71 67
Code
exam_order_mod <-lm(formula = score_step2 ~ which_exam_first + school_factor, data = step2_complete)summary(exam_order_mod) #use this to get the betas, but get the standard error and p values from the sandwich command
Call:
lm(formula = score_step2 ~ which_exam_first + school_factor,
data = step2_complete)
Residuals:
Min 1Q Median 3Q Max
-112.100 -5.178 2.767 8.761 21.900
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 261.3383 5.2033 50.225 <2e-16 ***
which_exam_first 0.8952 4.5256 0.198 0.844
school_factorUVA -4.0290 5.0342 -0.800 0.425
school_factorVUMS -0.9797 3.5128 -0.279 0.781
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.69 on 134 degrees of freedom
Multiple R-squared: 0.01018, Adjusted R-squared: -0.01198
F-statistic: 0.4593 on 3 and 134 DF, p-value: 0.7112
Code
sqrt(diag(sandwich(exam_order_mod))) #this gives the sandwich variance now set to be the standard error
p_interim <-pt(abs(exam_order_mod$coefficients)/sqrt(diag(sandwich(exam_order_mod))) , df =134) options(scipen =999)1- p_interim #here are the p values with using the robust se
Based on the model output (p values), there doesn’t appear to be any signifigant associations between exam order and the score for Step 2. Since the R^2 value is essentially 0, I conclude that there was no effect of exam order on step 2 scores.
Again, we cannot analyze Step 1 scores since all respondents reported passing.
Based on our SAP, if there are any covariates with more than 30% of responses missing, we will drop that variable or populate it with 0, depending on context. For example, the percent of Amboss questions completed will be filled with 0 for people who didn’t answer that they used Amboss since it seems safe to assume they didn’t complete any of the Amboss questions. For people who did select that they used Amboss but didn’t answer a percentage will be treated as missing
If less than 30% are missing, I may consider performing bootstrap sampling of known values to replace missing values. After accounting for missingness, I will assess for co-linearity of the predictors (ie. correlation) using VIF. If there is high co-linearity, we will use LASSO to perform variable selection. If there is no evidence of concerning levels of colinearity, I will proceed with linear regression.
What actually ended up happening is that I did imputation using the mice package to account for missing data, then tried running a model but it had too high of VIF values for the length of time studying since this was a factor variable and there were <5 observations for multiple categories. Due to this sparsity of data in some categoties, it was not as advisable to include this predictor as a categorical variable (see bar plot) so that’s why we decided to change this variable to be continuous. Then there was no longer an issue with VIF values and therefore no need to run LASSO. While changing a predictor to be continous does impose an additional assumption about the linear relationship between amount of time studying, we felt it was a reasonable assumption to impose since it allows us to have one summary coefficient and allows us to “borrow information” across levels of the data.
I will keep the previous code here to show thought process and you’re welcome to uncomment/run it if you’d like to see the other models, but they are not what we’re using in the final report for the reasons listed above.
Code
#here I am creating the cleaned Uworld and Amboss variables. if they did not check that they used it, the percentage is set to 0.# if they did use it then the next ifelse checks if there is an NA where there should be a percent# if there is an NA where there should be a percent, we maintain this NA. If there is not, then we can report the percentage.# same process for uworldstep2_complete <- step2_complete %>%mutate(amboss_clean =ifelse(I.amboss ==0, 0, ifelse(is.na(step2_complete$amboss_percent_step2) ==TRUE, NA, step2_complete$amboss_percent_step2)),uworld_clean =ifelse(I.uworld ==0, 0, ifelse(is.na(step2_complete$uworld_percent_step2 ) ==TRUE, NA, step2_complete$uworld_percent_step2)))#inspecting percent missing, it seems like most responses are now complete enough for analysisstep2_complete$amboss_percent_step2 <-ifelse(is.na(step2_complete$amboss_percent_step2) ==TRUE, 0, step2_complete$amboss_percent_step2)class(step2_complete$practicetest_step2) <-"integer"step2_complete[,"on_target"] <-factor(step2_complete$target_score_step2, levels =c(1,2,3), labels =c("at target", "above target", "below target"))# I looked into the difference between practice_test_step2 (the final practice test I took before my exam was... 8 options) and practicetest_step2 (text response of how many practice tests did you take before step 2).# I decided to use practicetest_step2 (text response of how many practice tests did you take before step 2) but extract and recode it as numeric belowstep2_complete[, "practice_test_2_clean"] <-ifelse(is.na(step2_complete$practicetest_step2) ==TRUE, NA, substr(step2_complete$practicetest_step2, start =1, stop =1))step2_complete[, "number_of_practice_tests"] <-as.numeric(step2_complete$practice_test_2_clean) #I want to include the number of practice tests as just one numeric covariate, not coded as a factor like it is in practice_test_2_cleanstep2_complete$full_practice_test <-factor(step2_complete$full_test_practice_step2 , levels =c(1:4), labels =c("Yes and I'm glad I did", "Yes and it was unnecessary", "No and I wish I did", "No and I'm glad I did not"))step2_complete$simulate_full_practice <-ifelse(step2_complete$full_test_practice_step2 <=2, 1, 0)step2_complete$simulate_full_practice <-factor(step2_complete$simulate_full_practice, levels =c(0, 1), labels =c("No", "Yes"))step2_complete$length_step2 <-factor(step2_complete$length_step2 , levels =c(1:6), labels =c("less than 1 week", "1-2 weeks", "3-4 weeks", "5-6 weeks", "7-8 weeks", "more than 8 weeks"))######## profile missingness in the step 2 data and addressstep2_profile <- step2_complete %>%relocate(c(score_step2, uworld_clean, amboss_clean, length_step2, simulate_full_practice, practice_score_step2, number_of_practice_tests, school_factor), .before = schoolid)percents_missing <-round(colSums(is.na(step2_profile))/nrow(step2_profile), 3)*100kable(percents_missing, caption ="Percent missing observations for pooled Step 2 survey")
Percent missing observations for pooled Step 2 survey
x
record_id
0.0
score_step2
0.0
uworld_clean
1.4
amboss_clean
9.4
length_step2
1.4
simulate_full_practice
0.0
practice_score_step2
29.7
number_of_practice_tests
5.1
school_factor
0.0
schoolid
0.0
exam_order
0.0
uworld_percent_step2
1.4
amboss_percent_step2
0.0
practicetest_step2
5.1
full_test_practice_step2
0.0
practice_test_step2
29.0
target_score_step2
0.0
I.uworld
0.0
I.firstaid
0.0
I.anki
0.0
I.sketchy
0.0
I.amboss
0.0
I.pathoma
0.0
I.boardsbeyond
0.0
I.other
0.0
order_factor
0.0
which_exam_first
0.0
on_target
0.0
practice_test_2_clean
5.1
full_practice_test
0.0
Multiple linear regression with:
Y = Step 2 score (from “What was your 3-digit score on the official Step 2 exam?”)
X1 = % UWorld (recoded from “Approximately what % of UWorld did you complete during your protected study period for step 2?”)
X2 = % Amboss (recoded from “Approximately what % of the Amboss question bank did you complete during your Step 2 protected study period?”)
X3 = length study (factor from “How long did you study for Step 2 during a protected study period?”)
X4 = number of practice tests (recoded from a text answer from “How many total practice tests did you take before Step 2?”)
X5 = full test day (yes/no code as binary from “Did you simulate a full Step 2 test day experience, e.g. 8 blocks of 40 UWorld questions or 2 of the 4-section practice tests?”)
X6 = final practice score (What was your 3 digit score on the final practice test you took before the Step 2 exam?)
Z = School (need to adjust for this in order to remove any influence)
Careful to note that not all cases are complete - for example there are 399 responses in the complete step 2 dataset, of which for the number of practice tests taken, 108 are missing and 291 have a response recorded.
Below I report the model results, sandwich variance, and VIF for 3 different versions of the step 2 scores model. The first is essentially a complete case analysis, then I run a model on imputed data, and then I run LASSO to select predictors on the imputed data and use the LASSO selected predictors to create the 3rd model. After that, I show descriptive statistics for predictors and the outcome that was included in the model.
Code
# for step 2ggplot(data.frame(step2_complete), aes(x=length_step2)) +geom_bar() +theme_minimal() +geom_text(aes(label = ..count..), stat ="count", vjust =1.5, colour ="white") +xlab("Time") +ylab("Frequency") +ggtitle("How long did you study for Step 2 during a protected study period?", subtitle=paste(" answered = ",sum(!is.na(step2_complete$length_step2)), " missing = ", sum(is.na(step2_complete$length_step2))))
Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(count)` instead.
This plot shows that there is an issue with the number of participants who reported 1-2 weeks or more than 8 weeks so it would be advantageous to “borrow information” across levels by coding length of study as a continous variable, rather than as a categorical factor.
2.0.1 DO NOT USE THIS PART, THIS WAS A PART OF OUR ANALYSIS THOUGTHS BUT NOT WHAT WE ENDED UP GOING WITH DUE TO DATA AVAILABILITY
Code
mod_step2_scores <-lm(score_step2 ~ uworld_clean + amboss_clean + length_step2 + simulate_full_practice + practice_score_step2 + number_of_practice_tests + school_factor, data = step2_complete) #summary(mod_step2_scores) # sqrt(diag(sandwich(mod_step2_scores))) #this gives the sandwich variance now set to be the standard error# p_interim <-pt(abs(mod_step2_scores$coefficients)/sqrt(diag(sandwich(mod_step2_scores))) , df = 72) # 1 - p_interim #here are the p values with using the robust se# vif(mod_step2_scores)
Now showing the same the model with imputed missing values.
Code
## model variables were uworld_clean + amboss_clean + length_step2 + simulate_full_practice + practice_score_step2 + number_of_practice_tests + school_factor# with missingness in uworld_clean, amboss_clean, length_step2, practice_score_step2, and number_of_practice_testsstep2_impute <- step2_complete %>%select( score_step2, uworld_clean, amboss_clean, length_step2, simulate_full_practice, practice_score_step2, number_of_practice_tests, school_factor )set.seed(123123) #so the results are reporoduciblestep2_impute = mice::mice(step2_impute,m=1,method='sample') %>% complete #better way to impute according to Jeffrey
#this is the number of missingnesskable(colSums(is.na(step2_complete[,c("uworld_clean", "amboss_clean", "length_step2", "simulate_full_practice", "practice_score_step2", "number_of_practice_tests", "school_factor")])), caption ="Missingness before imputation")
Missingness before imputation
x
uworld_clean
2
amboss_clean
13
length_step2
2
simulate_full_practice
0
practice_score_step2
41
number_of_practice_tests
7
school_factor
0
Missingness after imputation:
Code
colSums(is.na(step2_impute)) #missingness after imputation - should all be 0's
# here is are two ways to confirm the change was as expected (uncomment to see them in your console):# cbind(step2_impute$length_step2, step2_complete$length_step2)# cbind(step2_impute$practice_score_step2, step2_complete$practice_score_step2)mod_impute_step2_scores <-lm(score_step2 ~ uworld_clean + amboss_clean + length_step2 + simulate_full_practice + practice_score_step2 + number_of_practice_tests + school_factor, data = step2_impute) #summary(mod_impute_step2_scores)#calculating the p values using robust standard error:sqrt(diag(sandwich(mod_impute_step2_scores))) #this gives the sandwich variance now set to be the standard error
p_interim <-pt(abs(mod_impute_step2_scores$coefficients)/sqrt(diag(sandwich(mod_impute_step2_scores))) , df =114) 1- p_interim #here are the p values with using the robust se
#mod_impute_step2_scores %>% tbl_regression() #this makes a cleaner looking summary table
Here is where I run the LASSO (WHICH WE ARE NOT INCLUDING IN THE FINAL REPORT):
Code
## Lasso which we are not using anymoreset.seed(123123)my_design_mat <-model.matrix(score_step2~uworld_clean+amboss_clean+as.factor(length_step2)+as.factor(simulate_full_practice)+practice_score_step2+number_of_practice_tests+as.factor(school_factor),step2_impute)cv.glmnet(my_design_mat[,-1],step2_impute$score_step2,family ='gaussian',standardize = T,type.measure ='mse', lambda =seq(1e-2,1e+1,len=1e+4)) # based on this output, we can build the model based on 0.262
Call: cv.glmnet(x = my_design_mat[, -1], y = step2_impute$score_step2, lambda = seq(0.01, 10, len = 10000), type.measure = "mse", family = "gaussian", standardize = T)
Measure: Mean-Squared Error
Lambda Index Measure SE Nonzero
min 0.978 9031 172.5 66.88 4
1se 10.000 1 212.7 104.52 1
Code
model =glmnet(my_design_mat[,-1],step2_impute$score_step2,family ='gaussian',standardize = T,lambda =0.262)coef(model)
#now I need to make a model with a more limited set of predictors (this will stay the same since we set the seed)#recode length step 2 to have indicatorsstep2_LASSO <- step2_impute %>%mutate(study1.2.weeks =case_when(length_step2 =="1-2 weeks"~1, TRUE~0), study5.6.weeks =case_when(length_step2 =="5-6 weeks"~1, TRUE~0), study7.8.weeks =case_when(length_step2 =="7-8 weeks"~1, TRUE~0), study8.plus.weeks =case_when(length_step2 =="more than 8 weeks"~1, TRUE~0))## here is where I would run a model with the selected predictorsmod_LASSO_step2_scores <-lm(score_step2 ~ uworld_clean + amboss_clean + study1.2.weeks + study5.6.weeks + study7.8.weeks + study8.plus.weeks + simulate_full_practice + practice_score_step2 + number_of_practice_tests + school_factor, data = step2_LASSO) #careful, now the reference group absorbed in the intercept contains both people who have studied for less than 1 week AND people who have studied for 3-4 weeks## then make a summary of that model's output and the new robust se and p values# summary(mod_LASSO_step2_scores)# vif(mod_LASSO_step2_scores)# # #here are the p values# sqrt(diag(sandwich(mod_LASSO_step2_scores))) #this gives the sandwich variance now set to be the standard error# p_interim <-pt(abs(mod_LASSO_step2_scores$coefficients)/sqrt(diag(sandwich(mod_LASSO_step2_scores))) , df = 72) # 1 - p_interim #here are the p values with using the robust se
This is using the imputed values from the step 2 data
Code
#modify step2_impute dataset to modify length step 2 to be middle ranges of the values * note I am using the step2_impute dataset which is from the mice packagestep2_impute <- step2_impute %>%mutate(numeric_length_step2 =case_when(length_step2 =="less than 1 week"~0.5, length_step2 =="1-2 weeks"~1.5, length_step2 =="3-4 weeks"~3.5, length_step2 =="5-6 weeks"~5.5, length_step2 =="7-8 weeks"~7.5, length_step2 =="more than 8 weeks"~8.5))mod2_cont <-lm(score_step2 ~ uworld_clean + amboss_clean + numeric_length_step2 + simulate_full_practice + practice_score_step2 + number_of_practice_tests + school_factor, data = step2_impute) summary(mod2_cont) #use this to get the betas, but get the standard error and p values from the sandwich command
p_interim <-pt(abs(mod2_cont$coefficients)/sqrt(diag(sandwich(mod2_cont))) , df =72) options(scipen =999) # makes the p values easier to read1- p_interim #here are the p values with using the robust se
Exporting the data to post on Sharepoint (careful, right now this will load them on github if you uncomment these lines). Nick can use these to make the data dictionary.
There may not be sufficient data on this since only 20 people responded that they decided to push back Step 1.
Code
step1_complete$full_practice_test <-factor(step1_complete$full_test_practice_step1 , levels =c(1:4), labels =c("Yes and I'm glad I did", "Yes and it was unnecessary", "No and I wish I did", "No and I'm glad I did not"))step1_complete$simulate_full_practice_1 <-ifelse(step1_complete$full_test_practice_step1 <=2, 1, 0)step1_complete$simulate_full_practice_1 <-factor(step1_complete$simulate_full_practice_1, levels =c(0, 1), labels =c("No", "Yes"))dt_q3 = step1_complete %>%mutate(amboss_clean =if_else(I.amboss>0,amboss_percent_step1,0),uworld_clean =if_else(I.uworld>0,uworld_percent_step1,0),numeric_length_step1 =2*(length_step1-1)-0.5,push_step1 = push_step1-1) %>%left_join(step2_complete %>%select(record_id,schoolid, practice_score_step2)) %>%filter(!is.na(push_step1))set.seed(123123) #so the results are reporoducibledt_q3_impute =mice(dt_q3, m =1, method ='sample') %>%complete()
The model output above merits discussion in the final report.
2.0.2 old notes for Q3 analysis
Previously, I had run an analysis with other factors I thought were relevant but this was not actually what was requested (oops!). The factors were:
push remember step1 (1 = I only remember the form name, 2 = I only remember the score, 3 = I remember the form name and the score, 4 = I don’t remember either) - we decided not to include this variable (can change later if desired)
push score only step 1 (1 = NBME, 2 = Uworld)
push practice test step 1 (1 - 8 listing various exams)
push nbme practice score (from 0 to 100%)
push uw practice score (from 180 to 300)
Listing variables by name and if I have included them:
“push_step1_1” yes
“push_remember_step1_1” not included b/c a precursor question
“push_score_only_step1_1” not currently included but could be
“push_practice_test_step1_1” yes
“push_nbme_practice_score_step1_1” yes
“push_uw_practice_score_step1_1” yes
Code
step1_complete$push_step1 <-ifelse(step1_complete$push_step1 ==2|is.na(step1_complete$push_step1) ==TRUE, 2, 1) #recording the NA's to be "No" (they did not push back step 1)step1_complete$push_step1_label <-factor(step1_complete$push_step1, levels =c(1, 2), labels =c("Yes", "No")) #making a nice descriptive label#did_push_df <- step1_complete %>% filter(push_step1_label == "Yes")#dat %>% filter(!is.na(push_remember_step1_1))step1_complete$push_practice_test_step1 <-factor(step1_complete$push_practice_test_step1, levels =c(1:8), labels =c("NBME 25", "NBME 26", "NBME 27", "NBME 28", "NBME 29", "NBME 30", "UWorld 1", "UWorld 2"))#making labels for descriptive tableunits(step1_complete$push_uw_practice_score_step1) <-"score units"units(step1_complete$push_nbme_practice_score_step1) <-"percent"label(step1_complete$push_practice_test_step1) <-"exam that triggered pushing back"label(step1_complete$push_nbme_practice_score_step1) <-"NBME P(passing) that triggered pushing back"label(step1_complete$push_uw_practice_score_step1) <-"3 digit UWorld score that triggered pushing back"caption <-"Description of indiviuals that pushed back Step 1"table1(~ push_practice_test_step1 + push_uw_practice_score_step1 + push_nbme_practice_score_step1 |push_step1_label, data=step1_complete, topclass="Rtable1-zebra")
Yes (N=20)
No (N=162)
Overall (N=182)
exam that triggered pushing back
NBME 25
0 (0%)
0 (0%)
0 (0%)
NBME 26
0 (0%)
0 (0%)
0 (0%)
NBME 27
1 (5.0%)
0 (0%)
1 (0.5%)
NBME 28
1 (5.0%)
0 (0%)
1 (0.5%)
NBME 29
0 (0%)
0 (0%)
0 (0%)
NBME 30
0 (0%)
0 (0%)
0 (0%)
UWorld 1
0 (0%)
0 (0%)
0 (0%)
UWorld 2
1 (5.0%)
0 (0%)
1 (0.5%)
Missing
17 (85.0%)
162 (100%)
179 (98.4%)
3 digit UWorld score that triggered pushing back (score units)
Mean (SD)
180 (NA)
NA (NA)
180 (NA)
Median [Min, Max]
180 [180, 180]
NA [NA, NA]
180 [180, 180]
Missing
19 (95.0%)
162 (100%)
181 (99.5%)
NBME P(passing) that triggered pushing back (percent)
Mean (SD)
65.6 (18.3)
NA (NA)
65.6 (18.3)
Median [Min, Max]
67.5 [30.0, 87.0]
NA [NA, NA]
67.5 [30.0, 87.0]
Missing
12 (60.0%)
162 (100%)
174 (95.6%)
Descriptive statistics will be reported in total and not by school (as requested by the collaborators).
Looking at the data dictionary, we are given final_nbme_practice _score_step1 and final_uw_practice_s core_step1 which are two different covariates which are ONLY in Step 1. These specific questions were not included for step 2, instead there is a question for practice_score_step2, which I included in the model for step 2 score predictors above and will make a plot in this section.
Since everyone passed step 1, I did not run a regression model that included these covariates. Instead, I will make a frequency table.
Code
# Here is how I am planning to organize it:# 1 Question banks# - % UWorld for step 1 and 2# - % Amboss for step 1 and 2# 2 length of study for step 1 and step 2 (barplot with frequency table)# 3 # of practice tests (recoded from a text answer from "How many total practice tests did you take before Step 2?")# 4 number of practice tests histogram (among the people who answered the question) for step 1# 5a full test day as factor from "Did you simulate a full Step 2 test day experience, e.g. 8 blocks of 40 UWorld questions or 2 of the 4-section practice tests?" "full_practice_test"# 5b full day practice test step 1# 6 histogram of Step 2 scores# 7 frequency table of Step 1 scores# 8 what resources are most widely used barplot# 9 practice score step 2# 10 summarize comments for other_resources, other_step, changes_step# 11 Practice Scores for Step 1# 11a probability of passing for step 1# 11b final_nbme_practice_score_step1 is a percent from 0 to 100# 11c final_uw_practice_score_step1 is a 3 digit UWorld score for step 1#### and these are covered in other plots already (as numbered)# 11b looking at the data dictionary, we are given final_nbme_practice _score_step1 and final_uw_practice_s core_step1 which are two different covariates which are ONLY in Step 1.# 7 Since everyone passed step 1, I did not run a regression model that included these covariates. Instead, I will make a plot to show this in descriptive data.# 9 These specific questions were not included for step 2, instead there is a question for practice_score_step2, which I do include in the model for step 2 score predictors above.############################ now see the plots below as organized by the list I made above# 1 Question banks# - % UWorld for step 1 and 2# - % Amboss for step 1 and 2qudf <-data.frame(amount =c(step1_complete$uworld_percent_step1, step1_complete$amboss_percent_step1, step2_complete$uworld_percent_step2, step2_complete$amboss_percent_step2), step_exam =c(rep("Step 1", times =nrow(step1_complete)*2), rep("Step 2", times =nrow(step2_complete)*2)), resource_name =c(rep("UWorld", times =nrow(step1_complete)), rep("Amboss", times =nrow(step1_complete)), rep("Uworld", times =nrow(step2_complete)), rep("Amboss", times =nrow(step2_complete))))ggplot(qudf, aes(amount)) +geom_histogram() +facet_wrap(~step_exam + resource_name) +theme_minimal() +scale_color_brewer() +labs(title ="What percent of the UWorld/Amboss question bank did you complete?")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# 2 length of study for step 1 and step 2 (barplot with frequency table)# for step 1step1_complete$length_step1_lab <-factor(step1_complete$length_step1 , levels =c(1:6), labels =c("less than 1 week", "1-2 weeks", "3-4 weeks", "5-6 weeks", "7-8 weeks", "more than 8 weeks")) #make the factor labels which wasn't done beforeggplot(data.frame(step1_complete), aes(x=factor(length_step1_lab))) +geom_bar() +theme_minimal() +geom_text(aes(label = ..count..), stat ="count", vjust =1.5, colour ="white") +xlab("Time") +ylab("Frequency") +ggtitle("How long did you study for Step 1 during a protected study period?", subtitle=paste(" answered = ",sum(!is.na(step1_complete$length_step1)), " missing = ", sum(is.na(step1_complete$length_step1))))
Code
# for step 2ggplot(data.frame(step2_complete), aes(x=length_step2)) +geom_bar() +theme_minimal() +geom_text(aes(label = ..count..), stat ="count", vjust =1.5, colour ="white") +xlab("Time") +ylab("Frequency") +ggtitle("How long did you study for Step 2 during a protected study period?", subtitle=paste(" answered = ",sum(!is.na(step2_complete$length_step2)), " missing = ", sum(is.na(step2_complete$length_step2))))
Step 2 # of practice tests
Code
# 3 # of practice tests (recoded from a text answer from "How many total practice tests did you take before Step 2?")ggplot(aes(x = number_of_practice_tests), data= step2_complete) +geom_bar() +theme_minimal() +xlab("Number of tests") +ylab("Frequency") +ggtitle("How many total practice tests did you take before Step 2?", subtitle=paste(" answered = ",sum(!is.na(step2_complete$number_of_practice_tests)), " missing = ", sum(is.na(step2_complete$number_of_practice_tests))))
# 4 number of practice tests histogram (among the people who answered the question) for step 1ggplot(aes(x = practicetest_step1), data= step1_complete) +geom_bar() +scale_color_brewer(palette ="Paired") +theme_minimal() +xlab("Number of tests") +ylab("Frequency") +ggtitle("How many total practice tests did you take before Step 1?", subtitle=paste(" answered = ",sum(!is.na(step1_complete$practicetest_step1)), " missing = ", sum(is.na(step1_complete$practicetest_step1))))
## number of practice tests histogram (among the people who answered the question)
Simulating a full test day for Step 2
Code
# 5a full test day as factor from "Did you simulate a full Step 2 test day experience, e.g. 8 blocks of 40 UWorld questions or 2 of the 4-section practice tests?" "full_practice_test"ggplot(aes(x = full_practice_test), data= step2_complete) +coord_flip() +scale_color_brewer(palette ="Greens") +geom_bar() +theme_minimal() +xlab("") +ylab("Frequency") +ggtitle("Did you simulate a full Step 2 test day experience, e.g. 8 blocks \n of 40 UWorld questions or 2 of the 4-section practice tests?", subtitle=paste(" answered = ",sum(!is.na(step2_complete$full_practice_test)), " missing = ", sum(is.na(step2_complete$full_practice_test))))
Full day practice test for Step 1 (as added by Megan)
Code
# 5b full test day as factor from "Did you simulate a full Step 1 test day experience, e.g. 8 blocks of 40 UWorld questions or 2 of the 4-section practice tests?" "full_practice_test"ggplot(aes(x = simulate_full_practice_1), data= dt_q3) +coord_flip() +scale_color_brewer(palette ="Greens") +geom_bar() +theme_minimal() +xlab("") +ylab("Frequency") +ggtitle("Did you simulate a full Step 1 test day experience, e.g. 8 blocks \n of 40 UWorld questions or 2 of the 4-section practice tests?", subtitle=paste(" answered = ",sum(!is.na(dt_q3$full_practice_test))," missing = ",sum(is.na(dt_q3$full_practice_test))))
Step 2 scores
Code
# 6 histogram of Step 2 scoresggplot(step2_complete, aes(x = score_step2)) +geom_histogram(bins =30) +theme_minimal() +scale_color_brewer(palette ="Greens") +xlab("Step 2 Score") +ylab("Count") +ggtitle("Frequency of reported Step 2 Scores", subtitle=paste(" answered = ",sum(!is.na(step2_complete$score_step2)), " missing = ", sum(is.na(step2_complete$score_step2)), "since this was an analysis inclusion condition"))
Step 1 scores
Code
# 7 frequency table of Step 1 scoreskable(table(step1_complete$score_step1), caption ="Frequency of passing Step 1") # would be nice to have a way to summarize these in the same table, but this can just be remade with a table in Word
Frequency of passing Step 1
Var1
Freq
1
154
Code
kable(table(is.na(step1_complete$score_step1)), caption ="True indicates missing values for Step 1 score")
True indicates missing values for Step 1 score
Var1
Freq
FALSE
154
TRUE
28
Resources use
Code
# 8 what resources are most widely used barplot resources_2 <- step2_complete %>%select(I.uworld:I.other) %>%summarise(across(everything(), ~sum(.)))resources_1 <- step1_complete %>%select(I.uworld:I.other) %>%summarise(across(everything(), ~sum(.)))rdf <-data.frame(amount =c(t(resources_1), t(resources_2)), step_exam =c(rep("Step 1", 8), rep("Step 2", 8)), resource_name =rep(c("Uworld", "First Aid", "Anki", "Sketchy", "Amboss", "Pathoma", "Boards and Beyond", "Other"), times =2))res_freq2 <- resources_2/dim(step2_complete)[1]res_freq1 <- resources_1/dim(step1_complete)[1]rdf_freq <-data.frame(amount =c(t(res_freq1), t(res_freq2)), step_exam =c(rep("Step 1", 8), rep("Step 2", 8)), resource_name =rep(c("Uworld", "First Aid", "Anki", "Sketchy", "Amboss", "Pathoma", "Boards and Beyond", "Other"), times =2))# http://www.sthda.com/english/wiki/ggplot2-barplots-quick-start-guide-r-software-and-data-visualization used this as a guide#what resources are most widely used? I want to sort this barplot in order of frequency but for some reason this wasn't working for me with reorder()ggplot(data=rdf, aes(x=resource_name, y=amount, fill = step_exam)) +geom_bar(stat="identity", position=position_dodge()) +coord_flip() +xlab("") +ylab("Count") +scale_fill_brewer(palette="Paired", name ="") +theme_minimal() +ggtitle("Which resources are most widely used by Step 1/2?")
Code
ggplot(data=rdf_freq, aes(x=resource_name, y=amount, fill = step_exam)) +geom_bar(stat="identity", position=position_dodge()) +coord_flip() +xlab("") +ylab("Proportion") +scale_fill_brewer(palette="Paired", name ="") +theme_minimal() +ggtitle("Which resources are most widely used by Step 1/2?")
Code
# 9 practice score step 2 (3 digit score)ggplot(step2_complete, aes(x = practice_score_step2)) +geom_histogram(bins =30) +theme_minimal() +scale_color_brewer(palette ="Greens") +xlab("Final Step 2 Practice Score") +ylab("Count") +ggtitle("What was your 3 digit score on the final practice test \n you took before the Step 2 exam?", subtitle=paste(" answered = ", sum(!is.na(step2_complete$practice_score_step2)), " missing = ", sum(is.na(step2_complete$practice_score_step2))))
NBME and Dr. Legallo's videos and notes from preclinical
1
nbme practice exams
1
NBME practice tests
1
Online MedEd
1
Pixorize
1
Youtube
1
Code
#If you used other resources not listed above, would you use them again?step1_complete$other_hascontents <-as.integer(nchar(step1_complete$other_resources_step1) >0) #flagging the rows that have contents https://stackoverflow.com/questions/64744988/testing-to-see-if-characters-are-present-in-a-cell-in-rmystring1 <- step1_complete %>%filter(other_hascontents ==TRUE) %>%select(other_resources_step1, other_step1) kable(mystring1, caption ="Other resources and if you'd used them again")
Other resources and if you'd used them again
other_resources_step1
other_step1
Pixorize
Yes, helped immensely with biochemistry and immunology
Dirty Medicine YouTube videos
yes
Dirty Medicine
Yes
nbme practice exams
yes
NBME practice tests
Yes
Youtube
Yes
Online MedEd
Maybe
NBME and Dr. Legallo's videos and notes from preclinical
yes, NBME and Dr. Legallo were the most helpful
What would you change?
Code
step1_complete$change_hascontents <-as.integer(nchar(step1_complete$changes_step1) >0) #as.data.frame(step1_complete[which(step1_complete$change_hascontents == 1), "changes_step1" ])mystring2 <- step1_complete %>%filter(change_hascontents ==TRUE) %>%select(changes_step1) %>%toString() cat(str_wrap(mystring2), "\n") #used this package to help it print nicer https://stringr.tidyverse.org/reference/str_wrap.html
c("I would ditch pathoma and just do Anki, UWorld, sketchy and pixorize just
for Biochem and immunology. I would focus my UWorld on things our preclinical
curriculum does not cover well like anatomy, pharmacology, microbiology,
etc. I would focus less on clinical info since we were coming off a year of
clerkships.", "I did it in 3 weeks and it was VERY stressful. However, this
ended up being more than enough time. I would feel reassured that again it is
PASS/FAIL so it will be OK!! ", "Would have started earlier in preclinical -
would have ignored my preclinical classes and used more outside resources to
build foundation. It was difficult to get up to speed for Step 1, and I regret
not studying more in preclinical as I think my Step 2 score would have been
significantly higher. I feel that what was not discussed was how much Step 1
content is still relevant to Step 2.", "I would have studied for 4 weeks instead
of 6 weeks. However, I took my test after thanksgiving so there were less test
dates available. ", "More SketchyMicro. Start using outside resources during
1st year. More UWorld.", "I could have studied less (like 3-4 weeks) because I
was consistently scoring well enough to pass on my practice tests but I was too
nervous to take any chances and decided to instead just use the entire 5 weeks
I had already decided to allocate", "I would make all of my study time part
time (like I did when at home for the first 4 weeks). I wouldn't let classmates
studying 24/7 (unnecessarily in my opinion) stress me out.", "I took about 3
weeks to review Sketchy Micro/Pharm and Pathoma as well as do some Uworld blocks
and practice tests. After scoring so highly on my practice test, I actually
stopped studying except for finishing Sketchy. If I could go back, I would just
incorporate more of Sketchy, Pathoma, and Uworld into pre-dedicated studying
and then just take Step 1 with perhaps 1-2 weeks max of dedicated.", "Focus
more on the things that overlap with step 2", "I just used sketchy/anki for
micro, and that was great. I don't love it for everything, but I think I might
have also tried that combo for some of the biochem. ", "I would have taken the
exam early. I think the school's guidance of waiting until you've gotten 99%
chance of passing on 2 practice tests was probably a bit excessive but without
more information I didn't want to risk not studying as much as possible. ",
"would have done more sketchy, moved through pathoma faster (knew most of it
already from shelf exams/clinical year)", "I would keep a more strict study
schedule in the beginning of my study period.", "Taking a longer period of time
for dedicated. Not done anything else but study during those months ", "Not do
Sketchy path ", "In hindsight, I realize that a more effective approach would
have been to dedicate approximately six weeks to my study plan. During this
time, I would have worked through the entirety of UWorld, Sketchy Micro and
Pharm, as well as Pathoma. Additionally, I would have made it a point to take
every available UWorld and NBME practice test, ensuring that I simulated strict
testing conditions. I should note that my ability to complete only about 25%
of UWorld was somewhat facilitated by my solid foundation from my first year of
medical school and the experience gained from my shelf exams during the PCE.",
"Spent more time studying during pre-clinical years ", "Use less time. 4-5 weeks
is good enough.", "I would have taken Step 1 prior to 3rd year clerkships.",
"I wish it was after pre-clinical year and not after clinical year.", "Since I
took step 1 after clerkships, I did not have to study much. I used it as a prep
time for basic sciences of step 2. I would not have bought World for step 1 as
I would have been fine without it", "I wish we had completed step 1 post pre
clerkship with dedicated time. ", "I would focus on pathophysiology section of
uworld", "It's really not that hard to pass. Just knowing First Aid and going
through UWorld is sufficient.", "I would have started reviewing material (i.e.
watching B&B, Sketchy pharm) before dedicated started so that I could devote
my dedicated time almost exclusively to UWorld questions", "Though I probably
could have studied much less, I did not find the \"% likelihood of passing\"
metric very reassuring and still ended up studying a lot of make sure it was
as reassuring as possible.", "I would probably study really hard for 2 weeks
and take it at the end of the 2 weeks", "I would probably take the test after 2
weeks instead of 3", "I would have taken it immediately after ending clerkships
and then gone off to enjoy my summer. I think many students would be more than
capable of taking the exam following first year before the start of clerkships,
and in many ways I think the increased proximity to many of the basic science
concepts tested on Step 1 would have been a major advantage. Some of the
studying felt like taking a step backwards instead of progressing to try to
advance the depth of my clinical knowledge going into the immersion phase.", "I
ended up taking 12 weeks. M1 was terrible for me and I didn't know how to study.
So, I would go back and study more for it during M1 or M2 but even 12 weeks was
pushing it for me to make up the deficit I was in ", "I took a practice test on
Day 1 of studying, which I did not find helpful. I did poorly on Day 1 and my
score improved by 50 points over 2.5 weeks, so the Day 1 baseline was useless.
I would have studied for slightly longer to feel more confident and take one
more practice exam. I took 1 practice exam pre-studying and 2 after studying. I
passed 1 exam 8 days prior to the test. My study period was 23 days, with days
off during that, and I think near 28 days would have been ideal to feel less
rushed and taken an additional exam. I do not have a great memory, so I felt
like I needed more time, but everyone's different.", "Much more uworld, less
sketchy and anki", "None", "If anything, maybe one less week. It can definitely
be done in 4 weeks but the extra time gave me a comfortable buffer", "I would
have a shorter study period and focused more on step 2", "I would have switched
to timed, untutored mode on U world sooner. On the real exam, my confidence was
so shaky because I was used to the immediate gratification of tutor mode and
being able to take my time to think through the harder questions.", "I wish I
would have started the premade anki deck during M1 instead of at the start of
M2. I finished about 50% of STEP 1 cards and 75% of STEP 2 cards by the time I
took Step 1 and I think it would have been much quicker to a passing score if
i had a higher percentage of STEP 1 cards mastered. I did not use UWorld at all
because I found that reviewing missed questions didn't help me at all. I never
remembered the info or explanations unless I memorized it with anki first. Once
I had things memorized, I had no issues applying the knowledge in exam form.
The only questions I missed were ones that I had never seen on anki.", "Just do
UWorld and know First Aid well, that's all you really need.", "None. ")
Practice Scores for Step 1 and Step 2
Code
# 11 Practice Scores for Step 1 and Step 2# 11a final_nbme_practice_score_step1 is a percent from 0 to 100ggplot(step1_complete, aes(x = final_nbme_practice_score_step1)) +geom_histogram(bins =30) +theme_minimal() +scale_color_brewer(palette ="Greens") +xlab("Final Step 1 Probability of Passing") +ylab("Count") +ggtitle("What was the probability of passing you received \n on your final NBME form before the Step 1 exam?", subtitle=paste(" answered = ", sum(!is.na(step1_complete$final_nbme_practice_score_step1)), " missing = ", sum(is.na(step1_complete$final_nbme_practice_score_step1))))
# 11b final_uw_practice_score_step1 is a 3 digit UWorld score for step 1ggplot(step1_complete, aes(x = final_uw_practice_score_step1)) +geom_histogram(bins =30) +theme_minimal() +scale_color_brewer(palette ="Greens") +xlab("Final Step 1 UWorld Score") +ylab("Count") +ggtitle("What was the 3 digit score you received on your final \n UWorld form before the Step 1 exam?", subtitle=paste(" answered = ", sum(!is.na(step1_complete$final_uw_practice_score_step1)), " missing = ", sum(is.na(step1_complete$final_uw_practice_score_step1))))
All the analyses are performed using the following: - R version 4.2.2 (2022-06-24); R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
---title: "USMLE Data Analysis"author: "Lisa, Jeffrey (q3)"date: "`r format(Sys.time(), '%B %d, %Y')`"format: html: theme: flatly code-fold: true code-tools: true html-math-method: katex toc: true toc-depth: 3 fig-width: 7 fig-height: 5 toc-title: "Contents" number-sections: true self-contained: true self-contained-math: true smooth-scroll: true fontsize: 0.8em title-block-banner: true citation-location: margin include-after-body: graph_fold.htmleditor: visualengine: knitr---# Data import and cleaningIn this section I import the data and prepare it for analysis.```{r setup}#| warning: false#| message: false#| echo: fencedlibrary(tidyverse)library(plotly)library(dplyr)library(tidyr)library(knitr)library(table1) #Create HTML Tables of Descriptive Statistics https://cran.r-project.org/web/packages/table1/vignettes/table1-examples.html#library(OMTM1) #https://github.com/schildjs/OMTM1/library(Hmisc)library(rms) # Regression Modeling Strategies by Frank https://cran.r-project.org/web/packages/rms/index.htmllibrary(modelsummary) #Summary Tables and Plots for Statistical Models and Data: Beautiful, Customizable, and Publication-Ready https://cran.r-project.org/web/packages/modelsummary/index.htmllibrary(scales) # The scales packages provides the internal scaling infrastructure used by ggplot2, and gives you tools to override the default breaks, labels, transformations and palettes. https://scales.r-lib.orglibrary(viridis) #colorslibrary(cowplot) #allows me to use plotgridlibrary(gridExtra) #adding tables to plotslibrary(visdat) #shows missing datalibrary(GGally) #makes pairs plotslibrary(sandwich) #for robust standard errorslibrary(gtsummary) #makes the tables look nicerlibrary(glmnet) #for running LASSOlibrary(mice)#if you want to rerun this in your computer, change the directory to where you want to work on your machine locally (do not upload data to github!)setwd("/Users/lisalevoir/BIOS7351_Collab/data_project2") #this line I would need to run in the consoleknitr::opts_knit$set(root.dir ="/Users/lisalevoir/BIOS7351_Collab/github/BIOS_Collaboration/project_2_analysis") #now I set global options for knitting, I also had to toggle global options > R Markdown > evaluate chunks in current directory#THIS IS WHERE YOU'D CHANGE THE FILE PATHS TO WHERE YOU STORED THE DATA LOCALLY#import the datadat <-read.csv("/Users/lisalevoir/BIOS7351_Collab/data_project2/combined_data_203.csv")#to compare that the merging went as expectedVUMSdat <-read.csv("/Users/lisalevoir/BIOS7351_Collab/data_project2/DATA_VUMS.csv")HMSdat <-read.csv("/Users/lisalevoir/BIOS7351_Collab/data_project2/DATA_HMS.csv")UVAdat <-read.csv("/Users/lisalevoir/BIOS7351_Collab/data_project2/DATA_UVA.csv")```### Inclusion/ExclusionCriteria to exclude students who most likely took a scored exam:- Any PhD students (n = 2)- Any 5th year program students (n = 19)- M4 students at Vanderbilt (n = 5)- Students who did not complete either Step survey (n = 2)- Students who specifically stated they took a scored Step 1 (n=1)```{r data_cleaning}#| warning: false#| message: false#| echo: false#creating a unique ID for each participant so I can easily summarize how many people have been included/excluded at each stepdat <- dat %>%mutate(uniqueID =paste(school, record_id))VUMSdat <- VUMSdat %>%mutate(uniqueID =paste("VUSM", record_id))HMSdat <- HMSdat %>%mutate(uniqueID =paste("HMS", record_id))UVAdat <- UVAdat %>%mutate(uniqueID =paste("UVASOM", record_id))## compare the individuals that were excluded by the collaborator (in dat) vs. the full datasetall_survey_ids <-c(VUMSdat$uniqueID, HMSdat$uniqueID, UVAdat$uniqueID)collaborator_survey_ids <- dat %>%select(uniqueID) %>%pull()``````{r}#| warning: false#| message: false#| echo: false#the list of IDs we decided as a group to excludeexcludeVU <-c(23, 26, 39, 40, 54, 3, 8, 12, 49, 60, 62, 64)excludeHMS <-c(1, 21, 28, 30, 34, 37, 39, 41, 44, 47, 49, 61)excludeUVA <-c(33, 47, 80, 81, 83)'%!in%'<-function(x,y)!('%in%'(x,y)) #make a way to use the not in commandVU_in <-filter(VUMSdat, record_id %!in% excludeVU)H_in <-filter(HMSdat, record_id %!in% excludeHMS)UVA_in <-filter(UVAdat, record_id %!in% excludeUVA)#now I want to select the columns I'd like to include for all of my analysis (so they're in the proper order for a cbind). this will be relatively easy to come back to edit later, if needed. # I am also including a school identifier as a common column before I rbind all 3 school data frames togetherVU_in[,"schoolid"] <-"VU"UVA_in[,"schoolid"] <-"UVa"H_in[,"schoolid"] <-"HMS"############ plan for how I will get the data in a format I want:# - pull relevant columns by "starts with"# - confirm all column names match, then# - rbind together once# - then I can select from this sheet the questions relevant to Step 1 first with ends with "_1", and those who took Step 1 second with a "_2"############ Now pulling the common columns we're interested in as predictors and outcomes##note, for VU I removed ""number_other_courses_step1_1" and starts_with("other_courses_step1_1___1) because these questions were not on the other school surveystook_step1_VU <- VU_in %>%select(starts_with("record_id"),starts_with("uworld_percent_step1"),starts_with("amboss_percent_step1"),starts_with("length_step1"),starts_with("practicetest_step1"),starts_with("full_test_practice_step1"), # split into binary: Yes and I am glad I did, Yes and it was unnecessary, No and I wish I did, No and I am glad I did notstarts_with("push_step1"),starts_with("push_practice_test_step1"),starts_with("push_nbme_practice_score_step1"),starts_with("push_uw_practice_score_step1"),starts_with("final_nbme_practice_score_step1"),starts_with("final_uw_practice_score_step1"),starts_with("score_step1"),starts_with("resources_step1"),starts_with("other_resources_step1"),starts_with("other_step1"),starts_with("changes_step1"),"uniqueID","schoolid", "exam_order" ) took_step1_UVA <- UVA_in %>%select(starts_with("record_id"),starts_with("uworld_percent_step1"),starts_with("amboss_percent_step1"),starts_with("length_step1"),starts_with("practicetest_step1"),starts_with("full_test_practice_step1"),starts_with("push_step1"),starts_with("push_practice_test_step1"),starts_with("push_nbme_practice_score_step1"),starts_with("push_uw_practice_score_step1"),starts_with("final_nbme_practice_score_step1"),starts_with("final_uw_practice_score_step1"),starts_with("score_step1"),starts_with("resources_step1"),starts_with("other_resources_step1"),starts_with("other_step1"),starts_with("changes_step1"),"uniqueID","schoolid", "exam_order" )took_step1_H <- H_in %>%select(starts_with("record_id"),starts_with("uworld_percent_step1"),starts_with("amboss_percent_step1"),starts_with("length_step1"),starts_with("practicetest_step1"),starts_with("full_test_practice_step1"),starts_with("push_step1"),starts_with("push_practice_test_step1"),starts_with("push_nbme_practice_score_step1"),starts_with("push_uw_practice_score_step1"),starts_with("final_nbme_practice_score_step1"),starts_with("final_uw_practice_score_step1"),starts_with("score_step1"),starts_with("resources_step1"),starts_with("other_resources_step1"),starts_with("other_step1"),starts_with("changes_step1"),"uniqueID","schoolid", "exam_order" )## now rbinding the three schools togethertook_step1general <-rbind(took_step1_H, took_step1_UVA, took_step1_VU)## splitting the dataset so I also have reference sheets specific to step 1 first and step 1 secondtook_step1_first <- took_step1general %>%select(ends_with("_1"), "schoolid") #I actually haven't used this data frame in the analysistook_step1_second <- took_step1general %>%select(ends_with("_2"), "schoolid") #I actually haven't used this data frame in the analysis#this function will accommodate for the survey which split up the step 2 responses by whether it was taken first or second. I would like to run analysis (and gather percent missing) across all scores as they are available. I can add an indicator column later to identify people who took it first/second. The idea with this function is the output (called storage_df) should be easy to colbind onto the original data frame. Voila!#also if we decide to change our minds and include more variables it will be quick to run.merge_my_columns <-function(input_cols, source_df){storage_df <-as.data.frame(matrix(nrow =nrow(source_df), ncol =length(input_cols)))names(storage_df) <- input_colsfor(i in1:length(input_cols)) { cols <- source_df %>%select(starts_with(input_cols[i])) %>%names()print(cols) new_cobined_col <-coalesce(source_df[, cols[1]], source_df[, cols[2]]) storage_df[, i] <- new_cobined_col}print("Above is a list of columns I have combined for you. Hope it looks right!")return(storage_df)}######## combine results across exam order for Step 2class(took_step1general$practicetest_step1_1) <-"integer"#had to change class in order to coalesce these. the original question was "How many total practice tests did you take before Step 1?" and the response field was text but it should just be integers. ## This will create a warning in the results that NAs are introduced by coercion but that is alrightclass(took_step1general$practicetest_step1_2) <-"integer"class(took_step1general$push_uw_practice_score_step1_2) <-"integer"class(took_step1general$push_uw_practice_score_step1_1) <-"integer"cols_step1 <-c("uworld_percent_step1", "amboss_percent_step1", "length_step1", "practicetest_step1","full_test_practice_step1", "push_step1", "push_practice_test_step1", "push_nbme_practice_score_step1", "push_uw_practice_score_step1" ,"final_nbme_practice_score_step1", "final_uw_practice_score_step1", "score_step1", "other_resources_step1", "other_step1","changes_step1")#now I need to handle the resources question, which was handled differentlytookstep1_resources <- took_step1general %>%mutate(I.uworld =coalesce(resources_step1_1___1 + resources_step1_2___1), I.firstaid =coalesce(resources_step1_1___2 + resources_step1_2___2),I.anki =coalesce(resources_step1_1___3 + resources_step1_2___3), I.sketchy =coalesce(resources_step1_1___4 + resources_step1_2___4), I.amboss=coalesce(resources_step1_1___5 + resources_step1_2___5), I.pathoma =coalesce(resources_step1_1___6 + resources_step1_2___6),I.boardsbeyond =coalesce(resources_step1_1___7 + resources_step1_2___7),I.other =coalesce(resources_step1_1___8 + resources_step1_2___8)) %>%select(I.uworld, I.firstaid, I.anki, I.sketchy, I.amboss, I.pathoma, I.boardsbeyond, I.other)to_add <-merge_my_columns(input_cols = cols_step1, source_df = took_step1general)#table(to_add$score_step1) ## all step 1 scores have an outcome so there is no one to drop/filter outstep1_complete <-bind_cols(took_step1general, to_add, tookstep1_resources) %>%select(record_id, uniqueID:I.other)``````{r data_cleaning_forstep2}#| warning: false#| message: false#| echo: falsetook_step2_VU <- VU_in %>%select(starts_with("record_id"),starts_with("uworld_percent_step2"),starts_with("amboss_percent_step2"),starts_with("length_step2"),starts_with("practicetest_step2"),starts_with("full_test_practice_step2"),starts_with("practice_score_step2"),starts_with("practice_test_step2"),starts_with("score_step2"),starts_with("target_score_step2"),starts_with("resources_step2"),"uniqueID","schoolid", "exam_order" ) #note, I removed ""number_other_courses_step1_1" and starts_with("other_courses_step1_1___1) because these questions were not on the other school surveystook_step2_UVA <- UVA_in %>%select(starts_with("record_id"),starts_with("uworld_percent_step2"),starts_with("amboss_percent_step2"),starts_with("length_step2"),starts_with("practicetest_step2"),starts_with("full_test_practice_step2"),starts_with("practice_score_step2"),starts_with("practice_test_step2"),starts_with("score_step2"),starts_with("target_score_step2"),starts_with("resources_step2"),"uniqueID","schoolid", "exam_order" )took_step2_H <- H_in %>%select(starts_with("record_id"),starts_with("uworld_percent_step2"),starts_with("amboss_percent_step2"),starts_with("length_step2"),starts_with("practicetest_step2"),starts_with("full_test_practice_step2"),starts_with("practice_score_step2"),starts_with("practice_test_step2"),starts_with("score_step2"),starts_with("target_score_step2"),starts_with("resources_step2"),"uniqueID","schoolid", "exam_order" )took_step2general <-rbind(took_step2_H, took_step2_UVA, took_step2_VU)took_step2_first <- took_step2general %>%select(ends_with("_1"), "schoolid", "uniqueID") #don't plan on using these, but available if neededtook_step2_second <- took_step2general %>%select(ends_with("_2"), "schoolid", "uniqueID")#now I need to handle the resources question, which was handled differentlytookstep2_resources <- took_step2general %>%mutate(I.uworld =coalesce(resources_step2_1___1 + resources_step2_2___1), I.firstaid =coalesce(resources_step2_1___2 + resources_step2_2___2),I.anki =coalesce(resources_step2_1___3 + resources_step2_2___3), I.sketchy =coalesce(resources_step2_1___4 + resources_step2_2___4), I.amboss=coalesce(resources_step2_1___5 + resources_step2_2___5), I.pathoma =coalesce(resources_step2_1___6 + resources_step2_2___6),I.boardsbeyond =coalesce(resources_step2_1___7 + resources_step2_2___7),I.other =coalesce(resources_step2_1___8 + resources_step2_2___8)) %>%select(I.uworld, I.firstaid, I.anki, I.sketchy, I.amboss, I.pathoma, I.boardsbeyond, I.other)########combine results across exam order for Step 2class(took_step2general$practice_score_step2_2) <-"integer"#had to change class in order to coalesce theseclass(took_step2general$score_step2_1) <-"integer"class(took_step2general$score_step2_2) <-"integer"columns_to_summarize <-c("uworld_percent_step2", "amboss_percent_step2", "length_step2", "practicetest_step2","full_test_practice_step2", "practice_score_step2", "practice_test_step2", "score_step2","target_score_step2")to_add <-merge_my_columns(input_cols = columns_to_summarize, source_df = took_step2general)#listing the people with no outcome reporteddrop_these_with_no_outcome <-bind_cols(took_step2general, to_add) %>%filter(is.na(score_step2)) %>%select("record_id", "schoolid", "uniqueID")#dropping the people with no outcomeintermediate_step2 <-bind_cols(took_step2general, to_add, tookstep2_resources) %>%filter(!is.na(score_step2))#creating the complete data setstep2_complete <- intermediate_step2 %>%select(record_id, schoolid:I.other)```There were `r length(intersect(all_survey_ids, collaborator_survey_ids))` participants after the the clients did data exclusion by hand, as compared to the total survey size which was `r length(step1_complete$uniqueID)` unique individuals included in the step 1 data.These are the record IDs that were excluded by the collaborator: `r setdiff(all_survey_ids, collaborator_survey_ids)`.For our analysis, I originally included everyone from the 3 school specific data sets. Then, we flagged individuals to remove and I remade the source data set. These are the records IDs that we decided to exclude:- VU record ids: 23, 26, 39, 40, 54, 3, 8, 12, 49, 60, 62, 64- HMS record ids: 1, 21, 28, 30, 34, 37, 39, 41, 44, 47, 49, 61- UVA record ids: 33, 47, 80, 81, 83Below are tables for Step 1 and Step 2 response inclusion by school:```{r}kable(table(step1_complete$schoolid), caption ="Number of Step 1 responses included by school")kable(table(step2_complete$schoolid), caption ="Number of Step 2 responses included by school")```Notice that unfortunately for Step 2 analysis, we had to drop `r nrow(drop_these_with_no_outcome)` individuals who did not report a step 2 score. These raw counts and frequencies of people who did not give a step 2 score (and are therefor not eligible for analysis) are listed by institution below.```{r}# describing the missingnesskable(table(drop_these_with_no_outcome$schoolid), caption ="Number of missing Step 2 scores by institution")num <-as.vector(table(drop_these_with_no_outcome$schoolid))denom <-as.vector(table(step2_complete$schoolid))nonresponse_freq <-setNames(c(round(num/denom, 3)), c(names(table(step2_complete$schoolid))))kable(nonresponse_freq, caption ="frequency of missing step 2 scores by instition")```We also find it helpful to visually profile the missingness in this graphic below. Some of this missingness is due to questions being hierarchical.```{r}######### visually profile missing responsesp1 <- visdat::vis_miss(step1_complete)p2 <- visdat::vis_miss(step2_complete)title_gg <-ggdraw() +draw_label("Response missingness for pooled survey results across exam order")gridded <-plot_grid(p1, p2, label_size =12, ncol =2, align ="hv", label_x =0.5,labels =c("Step 1","Step 2"))plot_grid(title_gg, gridded, nrow =2, rel_heights =c(0.1, 0.9))```# AnalysisNote, we plan to use the robust/sandwich variance estimator for regression models. One inclusion criteria is that the outcome variable ("Y") must be available for a subject to be included in the analysis question (ie. if they did not report a step 2 score, we won't perform relevant step 2 analysis on them).::: panel-tabset## Q1 Does order have an impact on Step 2 scores?We cannot analyze Step 1 since all survey responses reported passing For Step 2 scores, I will perform a linear regression with:- Y = Step 2 score- X = factor ( 1 = "step 1 first", 2 = "step 2 first", 3 = "only step 1", 4 = "only step 2")- Z = school (need to adjust for this)```{r exam_order_analysis}step2_complete[ ,"order_factor"] <-factor(step2_complete$exam_order, levels =c(1,2, 3, 4), labels =c("step 1 first", "step 2 first", "only step 1", "only step 2"))step2_complete[ ,"which_exam_first"] <-ifelse(step2_complete$order_factor =="step 1 first"| step2_complete$order_factor =="only step 1", 1, 2)step2_complete[ ,"school_factor"] <-factor(step2_complete$schoolid, labels =c("HMS", "UVA", "VUMS"))table(step2_complete$order_factor)table(step2_complete$which_exam_first)exam_order_mod <-lm(formula = score_step2 ~ which_exam_first + school_factor, data = step2_complete)summary(exam_order_mod) #use this to get the betas, but get the standard error and p values from the sandwich commandsqrt(diag(sandwich(exam_order_mod))) #this gives the sandwich variance now set to be the standard errorp_interim <-pt(abs(exam_order_mod$coefficients)/sqrt(diag(sandwich(exam_order_mod))) , df =134) options(scipen =999)1- p_interim #here are the p values with using the robust se```Based on the model output (p values), there doesn't appear to be any signifigant associations between exam order and the score for Step 2. Since the $R^2$ value is essentially 0, I conclude that there was no effect of exam order on step 2 scores.## Q2 background information for analysis thought process of "What factors affect Step 2 score?"Again, we cannot analyze Step 1 scores since all respondents reported passing.Based on our SAP, if there are any covariates with more than 30% of responses missing, we will drop that variable or populate it with 0, depending on context. For example, the percent of Amboss questions completed will be filled with 0 for people who didn't answer that they used Amboss since it seems safe to assume they didn't complete any of the Amboss questions. For people who did select that they used Amboss but didn't answer a percentage will be treated as missingIf less than 30% are missing, I may consider performing bootstrap sampling of known values to replace missing values. After accounting for missingness, I will assess for co-linearity of the predictors (ie. correlation) using VIF. If there is high co-linearity, we will use LASSO to perform variable selection. If there is no evidence of concerning levels of colinearity, I will proceed with linear regression.What actually ended up happening is that I did imputation using the mice package to account for missing data, then tried running a model but it had too high of VIF values for the length of time studying since this was a factor variable and there were \<5 observations for multiple categories. Due to this sparsity of data in some categoties, it was not as advisable to include this predictor as a categorical variable (see bar plot) so that's why we decided to change this variable to be continuous. Then there was no longer an issue with VIF values and therefore no need to run LASSO. While changing a predictor to be continous does impose an additional assumption about the linear relationship between amount of time studying, we felt it was a reasonable assumption to impose since it allows us to have one summary coefficient and allows us to "borrow information" across levels of the data.I will keep the previous code here to show thought process and you're welcome to uncomment/run it if you'd like to see the other models, but they are not what we're using in the final report for the reasons listed above.```{r data_profile_and_pairs_plots}#| warning: false#here I am creating the cleaned Uworld and Amboss variables. if they did not check that they used it, the percentage is set to 0.# if they did use it then the next ifelse checks if there is an NA where there should be a percent# if there is an NA where there should be a percent, we maintain this NA. If there is not, then we can report the percentage.# same process for uworldstep2_complete <- step2_complete %>%mutate(amboss_clean =ifelse(I.amboss ==0, 0, ifelse(is.na(step2_complete$amboss_percent_step2) ==TRUE, NA, step2_complete$amboss_percent_step2)),uworld_clean =ifelse(I.uworld ==0, 0, ifelse(is.na(step2_complete$uworld_percent_step2 ) ==TRUE, NA, step2_complete$uworld_percent_step2)))#inspecting percent missing, it seems like most responses are now complete enough for analysisstep2_complete$amboss_percent_step2 <-ifelse(is.na(step2_complete$amboss_percent_step2) ==TRUE, 0, step2_complete$amboss_percent_step2)class(step2_complete$practicetest_step2) <-"integer"step2_complete[,"on_target"] <-factor(step2_complete$target_score_step2, levels =c(1,2,3), labels =c("at target", "above target", "below target"))# I looked into the difference between practice_test_step2 (the final practice test I took before my exam was... 8 options) and practicetest_step2 (text response of how many practice tests did you take before step 2).# I decided to use practicetest_step2 (text response of how many practice tests did you take before step 2) but extract and recode it as numeric belowstep2_complete[, "practice_test_2_clean"] <-ifelse(is.na(step2_complete$practicetest_step2) ==TRUE, NA, substr(step2_complete$practicetest_step2, start =1, stop =1))step2_complete[, "number_of_practice_tests"] <-as.numeric(step2_complete$practice_test_2_clean) #I want to include the number of practice tests as just one numeric covariate, not coded as a factor like it is in practice_test_2_cleanstep2_complete$full_practice_test <-factor(step2_complete$full_test_practice_step2 , levels =c(1:4), labels =c("Yes and I'm glad I did", "Yes and it was unnecessary", "No and I wish I did", "No and I'm glad I did not"))step2_complete$simulate_full_practice <-ifelse(step2_complete$full_test_practice_step2 <=2, 1, 0)step2_complete$simulate_full_practice <-factor(step2_complete$simulate_full_practice, levels =c(0, 1), labels =c("No", "Yes"))step2_complete$length_step2 <-factor(step2_complete$length_step2 , levels =c(1:6), labels =c("less than 1 week", "1-2 weeks", "3-4 weeks", "5-6 weeks", "7-8 weeks", "more than 8 weeks"))######## profile missingness in the step 2 data and addressstep2_profile <- step2_complete %>%relocate(c(score_step2, uworld_clean, amboss_clean, length_step2, simulate_full_practice, practice_score_step2, number_of_practice_tests, school_factor), .before = schoolid)percents_missing <-round(colSums(is.na(step2_profile))/nrow(step2_profile), 3)*100kable(percents_missing, caption ="Percent missing observations for pooled Step 2 survey")```Multiple linear regression with:- Y = Step 2 score (from "What was your 3-digit score on the official Step 2 exam?")- X1 = % UWorld (recoded from "Approximately what % of UWorld did you complete during your protected study period for step 2?")- X2 = % Amboss (recoded from "Approximately what % of the Amboss question bank did you complete during your Step 2 protected study period?")- X3 = length study (factor from "How long did you study for Step 2 during a protected study period?")- X4 = number of practice tests (recoded from a text answer from "How many total practice tests did you take before Step 2?")- X5 = full test day (yes/no code as binary from "Did you simulate a full Step 2 test day experience, e.g. 8 blocks of 40 UWorld questions or 2 of the 4-section practice tests?")- X6 = final practice score (What was your 3 digit score on the final practice test you took before the Step 2 exam?)- Z = School (need to adjust for this in order to remove any influence)Careful to note that not all cases are complete - for example there are 399 responses in the complete step 2 dataset, of which for the number of practice tests taken, 108 are missing and 291 have a response recorded.Below I report the model results, sandwich variance, and VIF for 3 different versions of the step 2 scores model. The first is essentially a complete case analysis, then I run a model on imputed data, and then I run LASSO to select predictors on the imputed data and use the LASSO selected predictors to create the 3rd model. After that, I show descriptive statistics for predictors and the outcome that was included in the model.```{r}# for step 2ggplot(data.frame(step2_complete), aes(x=length_step2)) +geom_bar() +theme_minimal() +geom_text(aes(label = ..count..), stat ="count", vjust =1.5, colour ="white") +xlab("Time") +ylab("Frequency") +ggtitle("How long did you study for Step 2 during a protected study period?", subtitle=paste(" answered = ",sum(!is.na(step2_complete$length_step2)), " missing = ", sum(is.na(step2_complete$length_step2))))```This plot shows that there is an issue with the number of participants who reported 1-2 weeks or more than 8 weeks so it would be advantageous to "borrow information" across levels by coding length of study as a continous variable, rather than as a categorical factor.### DO NOT USE THIS PART, THIS WAS A PART OF OUR ANALYSIS THOUGTHS BUT NOT WHAT WE ENDED UP GOING WITH DUE TO DATA AVAILABILITY```{r}mod_step2_scores <-lm(score_step2 ~ uworld_clean + amboss_clean + length_step2 + simulate_full_practice + practice_score_step2 + number_of_practice_tests + school_factor, data = step2_complete) #summary(mod_step2_scores) # sqrt(diag(sandwich(mod_step2_scores))) #this gives the sandwich variance now set to be the standard error# p_interim <-pt(abs(mod_step2_scores$coefficients)/sqrt(diag(sandwich(mod_step2_scores))) , df = 72) # 1 - p_interim #here are the p values with using the robust se# vif(mod_step2_scores)```Now showing the same the model with imputed missing values.```{r mod2_with_imputation}## model variables were uworld_clean + amboss_clean + length_step2 + simulate_full_practice + practice_score_step2 + number_of_practice_tests + school_factor# with missingness in uworld_clean, amboss_clean, length_step2, practice_score_step2, and number_of_practice_testsstep2_impute <- step2_complete %>%select( score_step2, uworld_clean, amboss_clean, length_step2, simulate_full_practice, practice_score_step2, number_of_practice_tests, school_factor )set.seed(123123) #so the results are reporoduciblestep2_impute = mice::mice(step2_impute,m=1,method='sample') %>% complete #better way to impute according to Jeffrey#this is the number of missingnesskable(colSums(is.na(step2_complete[,c("uworld_clean", "amboss_clean", "length_step2", "simulate_full_practice", "practice_score_step2", "number_of_practice_tests", "school_factor")])), caption ="Missingness before imputation")```Missingness after imputation:```{r outdated_model2_with_imputation}colSums(is.na(step2_impute)) #missingness after imputation - should all be 0's# here is are two ways to confirm the change was as expected (uncomment to see them in your console):# cbind(step2_impute$length_step2, step2_complete$length_step2)# cbind(step2_impute$practice_score_step2, step2_complete$practice_score_step2)mod_impute_step2_scores <-lm(score_step2 ~ uworld_clean + amboss_clean + length_step2 + simulate_full_practice + practice_score_step2 + number_of_practice_tests + school_factor, data = step2_impute) #summary(mod_impute_step2_scores)#calculating the p values using robust standard error:sqrt(diag(sandwich(mod_impute_step2_scores))) #this gives the sandwich variance now set to be the standard errorp_interim <-pt(abs(mod_impute_step2_scores$coefficients)/sqrt(diag(sandwich(mod_impute_step2_scores))) , df =114) 1- p_interim #here are the p values with using the robust se#here this shows there's and issue with the new VIF because of colinearityvif(mod_impute_step2_scores)#mod_impute_step2_scores %>% tbl_regression() #this makes a cleaner looking summary table```Here is where I run the LASSO (WHICH WE ARE NOT INCLUDING IN THE FINAL REPORT):```{r lasso_model2}## Lasso which we are not using anymoreset.seed(123123)my_design_mat <-model.matrix(score_step2~uworld_clean+amboss_clean+as.factor(length_step2)+as.factor(simulate_full_practice)+practice_score_step2+number_of_practice_tests+as.factor(school_factor),step2_impute)cv.glmnet(my_design_mat[,-1],step2_impute$score_step2,family ='gaussian',standardize = T,type.measure ='mse', lambda =seq(1e-2,1e+1,len=1e+4)) # based on this output, we can build the model based on 0.262model =glmnet(my_design_mat[,-1],step2_impute$score_step2,family ='gaussian',standardize = T,lambda =0.262)coef(model)#now I need to make a model with a more limited set of predictors (this will stay the same since we set the seed)#recode length step 2 to have indicatorsstep2_LASSO <- step2_impute %>%mutate(study1.2.weeks =case_when(length_step2 =="1-2 weeks"~1, TRUE~0), study5.6.weeks =case_when(length_step2 =="5-6 weeks"~1, TRUE~0), study7.8.weeks =case_when(length_step2 =="7-8 weeks"~1, TRUE~0), study8.plus.weeks =case_when(length_step2 =="more than 8 weeks"~1, TRUE~0))## here is where I would run a model with the selected predictorsmod_LASSO_step2_scores <-lm(score_step2 ~ uworld_clean + amboss_clean + study1.2.weeks + study5.6.weeks + study7.8.weeks + study8.plus.weeks + simulate_full_practice + practice_score_step2 + number_of_practice_tests + school_factor, data = step2_LASSO) #careful, now the reference group absorbed in the intercept contains both people who have studied for less than 1 week AND people who have studied for 3-4 weeks## then make a summary of that model's output and the new robust se and p values# summary(mod_LASSO_step2_scores)# vif(mod_LASSO_step2_scores)# # #here are the p values# sqrt(diag(sandwich(mod_LASSO_step2_scores))) #this gives the sandwich variance now set to be the standard error# p_interim <-pt(abs(mod_LASSO_step2_scores$coefficients)/sqrt(diag(sandwich(mod_LASSO_step2_scores))) , df = 72) # 1 - p_interim #here are the p values with using the robust se```## Q2 Report these for the model for step 2 scoresThis is using the imputed values from the step 2 data```{r}#modify step2_impute dataset to modify length step 2 to be middle ranges of the values * note I am using the step2_impute dataset which is from the mice packagestep2_impute <- step2_impute %>%mutate(numeric_length_step2 =case_when(length_step2 =="less than 1 week"~0.5, length_step2 =="1-2 weeks"~1.5, length_step2 =="3-4 weeks"~3.5, length_step2 =="5-6 weeks"~5.5, length_step2 =="7-8 weeks"~7.5, length_step2 =="more than 8 weeks"~8.5))mod2_cont <-lm(score_step2 ~ uworld_clean + amboss_clean + numeric_length_step2 + simulate_full_practice + practice_score_step2 + number_of_practice_tests + school_factor, data = step2_impute) summary(mod2_cont) #use this to get the betas, but get the standard error and p values from the sandwich commandvif(mod2_cont)sqrt(diag(sandwich(mod2_cont))) #this gives the sandwich variance now set to be the standard errorp_interim <-pt(abs(mod2_cont$coefficients)/sqrt(diag(sandwich(mod2_cont))) , df =72) options(scipen =999) # makes the p values easier to read1- p_interim #here are the p values with using the robust se```Exporting the data to post on Sharepoint (careful, right now this will load them on github if you uncomment these lines). Nick can use these to make the data dictionary.```{r}# write.csv(step2_impute, file = "step2_impute_export.csv")# write.csv(step2_complete , file = "step2_complete_export.csv")# write.csv(step1_complete, file = "step1_complete_export.csv")# write.csv(step2_LASSO , file = "step1_LASSO_export.csv")```## Q3 What is associated (in this data) with pushing back a Step 1 exam date?Here, we perform logistic regression withY = yes or no (1 = yes, 2 = no for "push_step1")There may not be sufficient data on this since only 20 people responded that they decided to push back Step 1. ```{r}#| warning: false#| message: falsestep1_complete$full_practice_test <-factor(step1_complete$full_test_practice_step1 , levels =c(1:4), labels =c("Yes and I'm glad I did", "Yes and it was unnecessary", "No and I wish I did", "No and I'm glad I did not"))step1_complete$simulate_full_practice_1 <-ifelse(step1_complete$full_test_practice_step1 <=2, 1, 0)step1_complete$simulate_full_practice_1 <-factor(step1_complete$simulate_full_practice_1, levels =c(0, 1), labels =c("No", "Yes"))dt_q3 = step1_complete %>%mutate(amboss_clean =if_else(I.amboss>0,amboss_percent_step1,0),uworld_clean =if_else(I.uworld>0,uworld_percent_step1,0),numeric_length_step1 =2*(length_step1-1)-0.5,push_step1 = push_step1-1) %>%left_join(step2_complete %>%select(record_id,schoolid, practice_score_step2)) %>%filter(!is.na(push_step1))set.seed(123123) #so the results are reporoducibledt_q3_impute =mice(dt_q3, m =1, method ='sample') %>%complete()model =glm( push_step1~ uworld_clean + amboss_clean + numeric_length_step1 + simulate_full_practice_1 + practicetest_step1 + schoolid, dt_q3_impute,family ='binomial') #logistic regressionwith(summary(model), 1- deviance/null.deviance) # to calculate R^2model_tb = broom::tidy(model) #make a nice displaymodel_tb$se =sqrt(diag(sandwich(model)))model_tb$p.value =pnorm(-abs(model$coefficients/model_tb$se))model_tb$odd_ratio =exp(model$coefficients)model_tb #print the output table```The model output above merits discussion in the final report.### old notes for Q3 analysisPreviously, I had run an analysis with other factors I thought were relevant but this was not actually what was requested (oops!).The factors were:- push remember step1 (1 = I only remember the form name, 2 = I only remember the score, 3 = I remember the form name and the score, 4 = I don't remember either) - we decided not to include this variable (can change later if desired)- push score only step 1 (1 = NBME, 2 = Uworld)- push practice test step 1 (1 - 8 listing various exams)- push nbme practice score (from 0 to 100%)- push uw practice score (from 180 to 300)Listing variables by name and if I have included them:- "push_step1_1" yes- "push_remember_step1_1" not included b/c a precursor question- "push_score_only_step1_1" not currently included but could be- "push_practice_test_step1_1" yes- "push_nbme_practice_score_step1_1" yes- "push_uw_practice_score_step1_1" yes```{r}step1_complete$push_step1 <-ifelse(step1_complete$push_step1 ==2|is.na(step1_complete$push_step1) ==TRUE, 2, 1) #recording the NA's to be "No" (they did not push back step 1)step1_complete$push_step1_label <-factor(step1_complete$push_step1, levels =c(1, 2), labels =c("Yes", "No")) #making a nice descriptive label#did_push_df <- step1_complete %>% filter(push_step1_label == "Yes")#dat %>% filter(!is.na(push_remember_step1_1))step1_complete$push_practice_test_step1 <-factor(step1_complete$push_practice_test_step1, levels =c(1:8), labels =c("NBME 25", "NBME 26", "NBME 27", "NBME 28", "NBME 29", "NBME 30", "UWorld 1", "UWorld 2"))#making labels for descriptive tableunits(step1_complete$push_uw_practice_score_step1) <-"score units"units(step1_complete$push_nbme_practice_score_step1) <-"percent"label(step1_complete$push_practice_test_step1) <-"exam that triggered pushing back"label(step1_complete$push_nbme_practice_score_step1) <-"NBME P(passing) that triggered pushing back"label(step1_complete$push_uw_practice_score_step1) <-"3 digit UWorld score that triggered pushing back"caption <-"Description of indiviuals that pushed back Step 1"table1(~ push_practice_test_step1 + push_uw_practice_score_step1 + push_nbme_practice_score_step1 |push_step1_label, data=step1_complete, topclass="Rtable1-zebra")```## Descriptive StatisticsDescriptive statistics will be reported in total and not by school (as requested by the collaborators).Looking at the data dictionary, we are given final_nbme_practice \_score_step1 and final_uw_practice_s core_step1 which are two different covariates which are ONLY in Step 1. These specific questions were not included for step 2, instead there is a question for practice_score_step2, which I included in the model for step 2 score predictors above and will make a plot in this section.Since everyone passed step 1, I did not run a regression model that included these covariates. Instead, I will make a frequency table.```{r}# Here is how I am planning to organize it:# 1 Question banks# - % UWorld for step 1 and 2# - % Amboss for step 1 and 2# 2 length of study for step 1 and step 2 (barplot with frequency table)# 3 # of practice tests (recoded from a text answer from "How many total practice tests did you take before Step 2?")# 4 number of practice tests histogram (among the people who answered the question) for step 1# 5a full test day as factor from "Did you simulate a full Step 2 test day experience, e.g. 8 blocks of 40 UWorld questions or 2 of the 4-section practice tests?" "full_practice_test"# 5b full day practice test step 1# 6 histogram of Step 2 scores# 7 frequency table of Step 1 scores# 8 what resources are most widely used barplot# 9 practice score step 2# 10 summarize comments for other_resources, other_step, changes_step# 11 Practice Scores for Step 1# 11a probability of passing for step 1# 11b final_nbme_practice_score_step1 is a percent from 0 to 100# 11c final_uw_practice_score_step1 is a 3 digit UWorld score for step 1#### and these are covered in other plots already (as numbered)# 11b looking at the data dictionary, we are given final_nbme_practice _score_step1 and final_uw_practice_s core_step1 which are two different covariates which are ONLY in Step 1.# 7 Since everyone passed step 1, I did not run a regression model that included these covariates. Instead, I will make a plot to show this in descriptive data.# 9 These specific questions were not included for step 2, instead there is a question for practice_score_step2, which I do include in the model for step 2 score predictors above.############################ now see the plots below as organized by the list I made above# 1 Question banks# - % UWorld for step 1 and 2# - % Amboss for step 1 and 2qudf <-data.frame(amount =c(step1_complete$uworld_percent_step1, step1_complete$amboss_percent_step1, step2_complete$uworld_percent_step2, step2_complete$amboss_percent_step2), step_exam =c(rep("Step 1", times =nrow(step1_complete)*2), rep("Step 2", times =nrow(step2_complete)*2)), resource_name =c(rep("UWorld", times =nrow(step1_complete)), rep("Amboss", times =nrow(step1_complete)), rep("Uworld", times =nrow(step2_complete)), rep("Amboss", times =nrow(step2_complete))))ggplot(qudf, aes(amount)) +geom_histogram() +facet_wrap(~step_exam + resource_name) +theme_minimal() +scale_color_brewer() +labs(title ="What percent of the UWorld/Amboss question bank did you complete?")```Length of study for each exam```{r}# 2 length of study for step 1 and step 2 (barplot with frequency table)# for step 1step1_complete$length_step1_lab <-factor(step1_complete$length_step1 , levels =c(1:6), labels =c("less than 1 week", "1-2 weeks", "3-4 weeks", "5-6 weeks", "7-8 weeks", "more than 8 weeks")) #make the factor labels which wasn't done beforeggplot(data.frame(step1_complete), aes(x=factor(length_step1_lab))) +geom_bar() +theme_minimal() +geom_text(aes(label = ..count..), stat ="count", vjust =1.5, colour ="white") +xlab("Time") +ylab("Frequency") +ggtitle("How long did you study for Step 1 during a protected study period?", subtitle=paste(" answered = ",sum(!is.na(step1_complete$length_step1)), " missing = ", sum(is.na(step1_complete$length_step1))))# for step 2ggplot(data.frame(step2_complete), aes(x=length_step2)) +geom_bar() +theme_minimal() +geom_text(aes(label = ..count..), stat ="count", vjust =1.5, colour ="white") +xlab("Time") +ylab("Frequency") +ggtitle("How long did you study for Step 2 during a protected study period?", subtitle=paste(" answered = ",sum(!is.na(step2_complete$length_step2)), " missing = ", sum(is.na(step2_complete$length_step2))))```Step 2 \# of practice tests```{r}# 3 # of practice tests (recoded from a text answer from "How many total practice tests did you take before Step 2?")ggplot(aes(x = number_of_practice_tests), data= step2_complete) +geom_bar() +theme_minimal() +xlab("Number of tests") +ylab("Frequency") +ggtitle("How many total practice tests did you take before Step 2?", subtitle=paste(" answered = ",sum(!is.na(step2_complete$number_of_practice_tests)), " missing = ", sum(is.na(step2_complete$number_of_practice_tests))))```Step 1 \# of practice tests```{r}# 4 number of practice tests histogram (among the people who answered the question) for step 1ggplot(aes(x = practicetest_step1), data= step1_complete) +geom_bar() +scale_color_brewer(palette ="Paired") +theme_minimal() +xlab("Number of tests") +ylab("Frequency") +ggtitle("How many total practice tests did you take before Step 1?", subtitle=paste(" answered = ",sum(!is.na(step1_complete$practicetest_step1)), " missing = ", sum(is.na(step1_complete$practicetest_step1))))## number of practice tests histogram (among the people who answered the question)```Simulating a full test day for Step 2```{r}# 5a full test day as factor from "Did you simulate a full Step 2 test day experience, e.g. 8 blocks of 40 UWorld questions or 2 of the 4-section practice tests?" "full_practice_test"ggplot(aes(x = full_practice_test), data= step2_complete) +coord_flip() +scale_color_brewer(palette ="Greens") +geom_bar() +theme_minimal() +xlab("") +ylab("Frequency") +ggtitle("Did you simulate a full Step 2 test day experience, e.g. 8 blocks \n of 40 UWorld questions or 2 of the 4-section practice tests?", subtitle=paste(" answered = ",sum(!is.na(step2_complete$full_practice_test)), " missing = ", sum(is.na(step2_complete$full_practice_test))))```Full day practice test for Step 1 (as added by Megan)```{r}# 5b full test day as factor from "Did you simulate a full Step 1 test day experience, e.g. 8 blocks of 40 UWorld questions or 2 of the 4-section practice tests?" "full_practice_test"ggplot(aes(x = simulate_full_practice_1), data= dt_q3) +coord_flip() +scale_color_brewer(palette ="Greens") +geom_bar() +theme_minimal() +xlab("") +ylab("Frequency") +ggtitle("Did you simulate a full Step 1 test day experience, e.g. 8 blocks \n of 40 UWorld questions or 2 of the 4-section practice tests?", subtitle=paste(" answered = ",sum(!is.na(dt_q3$full_practice_test))," missing = ",sum(is.na(dt_q3$full_practice_test))))```Step 2 scores```{r}# 6 histogram of Step 2 scoresggplot(step2_complete, aes(x = score_step2)) +geom_histogram(bins =30) +theme_minimal() +scale_color_brewer(palette ="Greens") +xlab("Step 2 Score") +ylab("Count") +ggtitle("Frequency of reported Step 2 Scores", subtitle=paste(" answered = ",sum(!is.na(step2_complete$score_step2)), " missing = ", sum(is.na(step2_complete$score_step2)), "since this was an analysis inclusion condition"))```Step 1 scores```{r}# 7 frequency table of Step 1 scoreskable(table(step1_complete$score_step1), caption ="Frequency of passing Step 1") # would be nice to have a way to summarize these in the same table, but this can just be remade with a table in Wordkable(table(is.na(step1_complete$score_step1)), caption ="True indicates missing values for Step 1 score")```Resources use```{r}# 8 what resources are most widely used barplot resources_2 <- step2_complete %>%select(I.uworld:I.other) %>%summarise(across(everything(), ~sum(.)))resources_1 <- step1_complete %>%select(I.uworld:I.other) %>%summarise(across(everything(), ~sum(.)))rdf <-data.frame(amount =c(t(resources_1), t(resources_2)), step_exam =c(rep("Step 1", 8), rep("Step 2", 8)), resource_name =rep(c("Uworld", "First Aid", "Anki", "Sketchy", "Amboss", "Pathoma", "Boards and Beyond", "Other"), times =2))res_freq2 <- resources_2/dim(step2_complete)[1]res_freq1 <- resources_1/dim(step1_complete)[1]rdf_freq <-data.frame(amount =c(t(res_freq1), t(res_freq2)), step_exam =c(rep("Step 1", 8), rep("Step 2", 8)), resource_name =rep(c("Uworld", "First Aid", "Anki", "Sketchy", "Amboss", "Pathoma", "Boards and Beyond", "Other"), times =2))# http://www.sthda.com/english/wiki/ggplot2-barplots-quick-start-guide-r-software-and-data-visualization used this as a guide#what resources are most widely used? I want to sort this barplot in order of frequency but for some reason this wasn't working for me with reorder()ggplot(data=rdf, aes(x=resource_name, y=amount, fill = step_exam)) +geom_bar(stat="identity", position=position_dodge()) +coord_flip() +xlab("") +ylab("Count") +scale_fill_brewer(palette="Paired", name ="") +theme_minimal() +ggtitle("Which resources are most widely used by Step 1/2?") ggplot(data=rdf_freq, aes(x=resource_name, y=amount, fill = step_exam)) +geom_bar(stat="identity", position=position_dodge()) +coord_flip() +xlab("") +ylab("Proportion") +scale_fill_brewer(palette="Paired", name ="") +theme_minimal() +ggtitle("Which resources are most widely used by Step 1/2?") # 9 practice score step 2 (3 digit score)ggplot(step2_complete, aes(x = practice_score_step2)) +geom_histogram(bins =30) +theme_minimal() +scale_color_brewer(palette ="Greens") +xlab("Final Step 2 Practice Score") +ylab("Count") +ggtitle("What was your 3 digit score on the final practice test \n you took before the Step 2 exam?", subtitle=paste(" answered = ", sum(!is.na(step2_complete$practice_score_step2)), " missing = ", sum(is.na(step2_complete$practice_score_step2))))```Summarizing comments for other_resources, other_step, changes_step```{r}# 10 summarize comments for other_resources, other_step, changes_step## summarize comments (other_resources, other_step, changes_step)#"other_resources_step1" "other_step1" "changes_step1" kable(table(step1_complete$other_resources_step1), caption ="Other listed resources for Step 1")#If you used other resources not listed above, would you use them again?step1_complete$other_hascontents <-as.integer(nchar(step1_complete$other_resources_step1) >0) #flagging the rows that have contents https://stackoverflow.com/questions/64744988/testing-to-see-if-characters-are-present-in-a-cell-in-rmystring1 <- step1_complete %>%filter(other_hascontents ==TRUE) %>%select(other_resources_step1, other_step1) kable(mystring1, caption ="Other resources and if you'd used them again")```What would you change?```{r}step1_complete$change_hascontents <-as.integer(nchar(step1_complete$changes_step1) >0) #as.data.frame(step1_complete[which(step1_complete$change_hascontents == 1), "changes_step1" ])mystring2 <- step1_complete %>%filter(change_hascontents ==TRUE) %>%select(changes_step1) %>%toString() cat(str_wrap(mystring2), "\n") #used this package to help it print nicer https://stringr.tidyverse.org/reference/str_wrap.html```Practice Scores for Step 1 and Step 2```{r}# 11 Practice Scores for Step 1 and Step 2# 11a final_nbme_practice_score_step1 is a percent from 0 to 100ggplot(step1_complete, aes(x = final_nbme_practice_score_step1)) +geom_histogram(bins =30) +theme_minimal() +scale_color_brewer(palette ="Greens") +xlab("Final Step 1 Probability of Passing") +ylab("Count") +ggtitle("What was the probability of passing you received \n on your final NBME form before the Step 1 exam?", subtitle=paste(" answered = ", sum(!is.na(step1_complete$final_nbme_practice_score_step1)), " missing = ", sum(is.na(step1_complete$final_nbme_practice_score_step1))))# 11b final_uw_practice_score_step1 is a 3 digit UWorld score for step 1ggplot(step1_complete, aes(x = final_uw_practice_score_step1)) +geom_histogram(bins =30) +theme_minimal() +scale_color_brewer(palette ="Greens") +xlab("Final Step 1 UWorld Score") +ylab("Count") +ggtitle("What was the 3 digit score you received on your final \n UWorld form before the Step 1 exam?", subtitle=paste(" answered = ", sum(!is.na(step1_complete$final_uw_practice_score_step1)), " missing = ", sum(is.na(step1_complete$final_uw_practice_score_step1))))```:::# Appendix/notesAll the analyses are performed using the following: - R version 4.2.2 (2022-06-24); R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.- Harrell Jr FE (2022). rms: Regression Modeling Strategies. R package version 6.3-0, <https://CRAN.R-project.org/package=rms>.The table below lists packages used in this document.```{r}subset(data.frame(sessioninfo::package_info()), attached==TRUE, c(package, loadedversion))```